Empirical Methods for Exploiting Parallel Texts

نویسنده

  • I. Dan Melamed
چکیده

Parallel translations of written texts have long been useful tools for human students of language, and have begun to serve as an intriguing source of data for corpus-based approaches to natural language processing. A source text and its translation can be viewed as a coarse map between the two languages, and an industrious student or clever computer program may wish to refine that mapping so that it shows which sentences, phrases, and words are translations of one another. Humans are very adept at finding such relations in parallel text. This is true even when one or both of the languages is unfamiliar, as can be seen in a simple but convincing exercise in (Knight, 1997). While there was considerable early success in automatically identifying sentences in parallel text that are translations of each other (e.g., (Brown, Lai, and Mercer, 1991), (Gale and Church, 1993)), a variety of challenging problems has emerged since that time. Empirical Methods for Exploiting Parallel Texts is a revision of the author’s 1998 Ph.D. dissertation (University of Pennsylvania), and succeeds in capturing the range of problems inherent in parallel text. It presents a variety of techniques for finding translation equivalents and demonstrates that once these are available they can be used to align text segments, detect omissions in translations, identify non-compositional compounds, and discriminate among word senses.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Empirical Methods for Exploiting Parallel Texts

Parallel translations of written texts have long been useful tools for human students of language, and have begun to serve as an intriguing source of data for corpus-based approaches to natural language processing. A source text and its translation can be viewed as a coarse map between the two languages, and an industrious student or clever computer program may wish to refine that mapping so th...

متن کامل

استخراج پیکره‌ موازی از اسناد قابل‌مقایسه برای بهبود کیفیت ترجمه در سیستم‌های ترجمه ماشینی

Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...

متن کامل

Empirical Methods for MT Lexicon Development

This article reviews some recently invented methods for au tomatically extracting translation lexicons from parallel texts The ac curacy of these methods has been signi cantly improved by exploiting known properties of parallel texts and of particular language pairs The state of the art has advanced to the point where translations can be found automatically and with high reliability even for no...

متن کامل

Exploiting Parallel Texts for Word Sense Disambiguation: An Empirical Study

A central problem of word sense disambiguation (WSD) is the lack of manually sense-tagged data required for supervised learning. In this paper, we evaluate an approach to automatically acquire sensetagged training data from English-Chinese parallel corpora, which are then used for disambiguating the nouns in the SENSEVAL-2 English lexical sample task. Our investigation reveals that this method ...

متن کامل

Mining parallel fragments from comparable texts

This paper proposes a novel method for exploiting comparable documents to generate parallel data for machine translation. First, each source document is paired to each sentence of the corresponding target document; second, partial phrase alignments are computed within the paired texts; finally, fragment pairs across linked phrase-pairs are extracted. The algorithm has been tested on two recent ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006